Comparison of Short Read De Novo Alignment Algorithms
نویسنده
چکیده
The objective of this paper is to survey the algorithms used for de novo alignment of short read data. Since the quality of the sequence bases which are aligned is important, this paper starts by comparing conventional sequencing methods and next-generation sequencing platforms. Next-generation sequencing poses new challenges to the bioinformatics community. A description of several de novo alignment algorithms is provided, after which there is a discussion about their differences in approach and whether or not the programs provide solutions to mitigate the disadvantages of using nextgeneration sequencing technology. From here, this paper describes a suggested implementation of a de novo alignment algorithm building upon the successful principles of the short-read de novo aligners surveyed. Conventional Sequencing There are three main approaches to conventional sequencing. Some approaches are similar to each other and some are very different. However, all of these conventional sequencing methods share certain properties with their output. Hierarchical Sequencing [1] The first sequencing method is called “hierarchical sequencing,” and it was the sequencing method of choice for the Human Genome Project. Hierarchical sequencing involves cutting genomic DNA into ~150Mb pieces and inserting them into BAC vectors. These BAC vectors are then transformed into E. Coli, replicated, and stored. The BAC inserts are then isolated, and each 150Mb fragment is mapped and ordered (“golden tiling path”). The “golden tiling path” is again randomly sheared into even smaller pieces. Each piece is cloned into a plasmid and sequenced on both strands. Contigs are now created with the sequence data and the genome is then assembled (with about 8x coverage). Shotgun Sequencing [1] The second sequencing method is called “shotgun sequencing.” This method was used by Celera in their effort to sequence and assemble the human genome. This method is known to be great for small genomes (such as that of prokaryotes) which do not contain too many repetitive sequences. The shotgun sequencing approach basically cuts out the use of BACs and goes right into the step of shearing DNA into random fragments and cloning them into plasmids (for both strands). From here, contigs are assembled and aligned. Omitting the BAC step makes this method more prone to errors. Since the chromosomal location of each BAC is known, there are not as many truly random pieces to assemble with the hierarchical method. “For example, if a 500 kb portion of a chromosome is duplicated and each duplication is cut into 2kb fragments, then it would be difficult to determine where a particular 2 kb piece should be located in the finished sequence since it occurs twice. You might think, 'who cares since they're duplicates?' But duplications seldom retain their original sequences; they tend to drift over time. So a small region may be retained while other parts may mutate. This might create overlapping sequences for small pieces that are located several hundred kb apart on the chromosome” [1]. It seems like another drawback from a straight shotgun sequencing approach is obtaining false positive alignments by concatenating two portions of the genome hundreds of base pairs apart. Sanger Sequencing [6] Sanger sequencing is another sequencing method. Sanger uses a special type of nucleotide in addition to normal nucleotides (ddNTP). These nucleotides have a hydrogen group where the hydroxyl group should be on the 3' end. This prevents phosphodiester bonds from forming and terminates the DNA chain. First, the DNA is denatured into two separate strands with heat and a primer is then annealed to one of the template strands. These primers can be specially constructed to bind to special parts of the template strand, giving the ability to sequence a region of interest. The primer or the nucleotide is fluorescently labeled so that they can be identified on a gel. The solution is divided into 4 tubes, A, C, T, and G, and each tube is filled with all four DNA nucleotides along with its corresponding ddNTP. Since the ddNTPs are randomly integrated, the fragments are all different sizes. However, all of the fragments have the same starting position. After this process, the DNA is denatured and run on a gel. Sanger sequencing produces reads up to 1000 bases long and usually have about 10x coverage. Synopsis of Conventional Sequencing Methods In a nutshell, the hierarchical approach to sequencing is more time consuming and expensive compared to the shotgun approach. However, the hierarchical approach method has its advantages in that the sequence data is less prone to producing inaccurate assemblies. Sanger sequencing does not produce reads as long as that of hierarchical or shotgun sequencing. Hierarchical and shotgun sequencing have about 8x coverage and Sanger sequencing has about 10x coverage—when it comes to coverage, they do not differ by much. Next-generation Sequencing Methods In modern times, many other methods have been developed to generate sequence data. In many ways, they are more advanced than the classical sequencing methods mentioned above. Instead of outlining how each platform works step-by-step as for conventional sequencing, only essential properties of these platforms will be covered. All of the below methods are considered next-generation sequencing platforms. I llumina's HiSeq 2000 [7] Illumina's HiSeq 2000 machine can produce 100 bp read lengths with 30x coverage and a relatively low error rate. Illumina technology is a form of flourescently labeled sequence technology. On the low end, it can generate up to 35 GB of data with reads of 35 bases in length. On the high end, it can generate up to 200 GB of data with reads of 100 bases in length. The HiSeq provides about 30x coverage. The HiSeq is also capable of multiplexing samples. This expands the type of experiments that can be done with the instrument and potentially adds more complexities to the sequence data. Roche's 454 [8] [20] Roche's 454 machine has longer read lengths than other methods. This mitigates the difficulty of mapping repetitive regions. However, these longer reads do not come without high error rates (especially in homo-polymer repeats). 454 has been used to do de novo alignment of bacterial and insect genomes. The 454 instrument is a parallelized version of pyro-sequencing technology. The instrument generates about .5 GB of data with read lengths of about 400 bases. Life Technology's SOLiD 4 [9] Life Technology's SOLiD can produce between 35 and 50 base pairs for each read. SOLiD has twobase encoding which provides a form of error correction. SOLiD sequences by a process of ligation. The instrument generates about 35 GB of data on the low end with reads of about 35 bases in length. On the high end, it can generate about 100GB of data with reads of up to 50 bases in length. SOLiD provides about 15x coverage. SOLiD can also multiplex data (up to 1,536 per run).
منابع مشابه
Clustering of Short Read Sequences for de novo Transcriptome Assembly
Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...
متن کاملA consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads
MOTIVATION Novel high-throughput sequencing technologies pose new algorithmic challenges in handling massive amounts of short-read, high-coverage data. A robust and versatile consensus tool is of particular interest for such data since a sound multi-read alignment is a prerequisite for variation analyses, accurate genome assemblies and insert sequencing. RESULTS A multi-read alignment algorit...
متن کاملTargeted Long-Read Sequencing of a Locus Under Long-Term Balancing Selection in Capsella
Rapid advances in short-read DNA sequencing technologies have revolutionized population genomic studies, but there are genomic regions where this technology reaches its limits. Limitations mostly arise due to the difficulties in assembly or alignment to genomic regions of high sequence divergence and high repeat content, which are typical characteristics for loci under strong long-term balancin...
متن کاملAccurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing
The ongoing research in sequencing technology has yielded in machines that are able to produce sequence data in the order of one billion base-pairs (bp) per machine day with an average read length of less than 100 bp per read (“short-reads”). In the past two years, many efficient algorithms have been developed for short-read alignment against a reference genome and for genome assembly, for an o...
متن کاملFine De Novo Sequencing of a Fungal Genome Using only SOLiD Short Read Data: Verification on Aspergillus oryzae RIB40
The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to...
متن کاملPebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler
BACKGROUND Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. PRINCIPAL FINDINGS We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to pro...
متن کامل